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Abstract:  A  generalization  of  a  known  class  of  parallel  sorting  algorithms 

is  presented,  together  with  a  new  architecture  to  execute  them.  A  VLSI 

implementation  is  also  proposed,  and  its  area-time  performance  is  discussed. 

It  is  shown  that  an  algorithm  in  the  class  is  executable  in  O(logn)  time  by 

a  chip  occupying  0(n  )  area.  The  design  is  a  typical  instance  of  a  "Hybrid 

architecture”,  resulting  from  the  combination  of  well-known  VLSI  arrays  as 

the  orthogonal-trees  and  the  cube-connected-cycles;  it  is  also  the  first 

2  2  2 

known  to  meet  the  AT  *  ft(n  log  n)  lower  bound  for  sorters  of  n  words  of 
length  (l+£)logn(e  >  0) ,and  working  in  minimum  O(logn)  time. 
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1.  Introduction 


Sorting  is  one  of  the  most  widelv  studied  problems  from  the  computa t ional 
point  of  view,  and  many  algorithms  have  been  proposed  for  its  solution.  Since  the 
possibility  of  parallel  computation  has  bucn  considered,  several  parallel  schemes 
have  also  been  proposed  for  sorting.  Different  models  for  parallel  computers 
are  possible,  and  several  have  been  considered  in  the  literature  during  the 
past  years.  Recently,  the  advent  of  the  Very  Large  Scale  Integrated  circuits 
(VLSI)  has  motivated  the  definition  of  a  new  model  of  computation  that  aims 
at  capturing  the  essential  features  of  the  new  technology.  Obviously,  sorting 
has  been  one  of  the  first  problems  studied  in  the  VLSI  environment,  and 
severa.1  results  are  already  available.  In  particular  Thompson  [1]  gives  a 
survey  of  thirteen  algorithms  for  sorting  and  discusses  their  performance  in 
terms  of  the  chip  area  A  and  of  the  time  T  that  elapses  between  beginning  and 
completion  of  a  computation.  Indeed,  area  and  time  are  natural  measures  of 
complexity  for  VLSI  circuits,  reflecting  production  cost  and  incremental  cost 
respec  t ively . 

A  theoretical  argument,  due  to  Thompson  [  2  ],  shows  that  any  sorter  of  n 

terms,  with  wordlength  q  =  (l+e)logn,  with  e  >  0 ,  must  satisfy  the  relationship 
2  2  2 

AT  *  C(n^log  n) .  The  argument  is  based  on  the  facts  that  any  chip  that  sorts 

must  support  a  flow  of  (p  «  P.(nlogn)  bits  through  a  suitable  bisection,  and 
2  2 

that  AT  =  .I(cp  ).  This  lower  bound  holds  in  a  suitable  VLSI  model  of  computation 
whose  basic  assumptions  are  that  the  chip  is  synchronous  ( transmission  time  is 
independent  of  wire  length)  and  semel lec t ive-unilocal  (input  data  are  read 
only  once,  at  prespecified  input  ports).  A  word-local  restriction  is  also  assumed 
for  the  input  format  (all  the  bits  of  the  same  word  enter  the  circuit  at  the 
same  point) . 

In  a  previous  paper  [3]  we  have  shown  that  optimal  VLSI  sorters  can  indeed 

->  _ _ 

be  constructed  for  all  computation  times  T  €  [:I(log  n)  fO(vnlogn)  ] .  These 
sorters  are  based  on  a  new  architecture,  the  Pleated-Cube-Connected-Cycles 
(PCCC),  and  execute  bitonic  sorting  [4  ]. 


In  chis  paper  we  concentrate  on  'Very  fast"  sorting,  i.e.,  the  class  of 
VLSI  sorting  algorithms  whose  running  time  is  T  ■  0(logn).  So  far  only  one 
VLSI  design  is  known  to  achieve  0(logn)  computation  time:  it  is  based  on  the 
orthogonal  trees  architecture  [5],  [6]  and  implements  an  algorithm  due  to 

Muller  and  Preparata  [7].^  The  optimal  layout  of  the  orthogonal  trees  has  area 

2  2  2 
A  *  0(n  log  n)  [6],  while  the  lower  bound  yields  A  ■  fl(n  )  for 

T  ■  O(logn).  On  the  other  hand,  a  closer  analysis  of  the  algorithm  shows  that 

the  information  flow  cp  is  O(nlogn) ,  so  that  the  gap  between  upper  and  lower 

bounds  is  not  due  to  a  gap  between  actual  flow  and  a  flow-based  lover  bound, 

but  it  is  due  to  the  fact  that  the  length  of  the  layout  bisection  of  the 

orthogonal  trees  is  O(logn)  times  as  large  as  the  graph  bisection. 

We  will  show  in  this  paper  that  not  only  the  lower  bound  on  the  flow, 

2 

but  also  the  one  on  the  AT  measure  is  tight,  by  exhibiting  a  new 

2 

architecture  capable  of  sorting  in  A  »  0(n  )  and  time  T  *  O(logn). 

The  rather  complex  network  is  a  typical  instance  of  the  "hybrid 
architecture",  resulting  from  the  careful  interplay  of  more  standard  VLSI 
networks,  as  the  cube-connected-cycles  machine,  the  mesh-connected  machine, 
and  the  binary-tree  machine.  The  implemented  algorithm  is  of  the  type  first 
introduced  by  Preparata  [8] ,  although  the  recursion  strategy  has  been  modified 
to  optimize  the  network  area. 

A  slight  modification  of  one  of  the  building  blocks  of  our  sorter  turns 
out  to  be  an  interesting  network  in  its  own  right.  It  is  called  the  mesh  of  CCC *s  » 
and  is  a  powerful  emulator  of  the  binary  cube,  matching  the  performance  of  both 
the  CCC  and  the  PCCC  machines. 

A  suitable  combination  of  one  O(logn)  sorter  and  one  mesh  of  CCC's 

2 

of  proper  size  will  allow  us  co  construct  an  AT  -optimal  sorter  for  any 
computation  time  T  €  [Q(logn) ,0(log3n) ] . 

^Subsequent  to  the  research  leading  to  this  paper,  we  learned  of  the 
construction  of  Altai,  Komloa,  and  Szemeredi  [13],  which  also  achieves 
d(logn)  time;  in  addition  we  have  devised  a  VLSI  implementation  of  their 


4 


Thus  we  are  able  to  conclude  that  optimal  AT  »  0(n  log  n)  sorting  is 

achievable  in  the  entire  "meaningful"  range  of  computation  times 

T  €  [Q  (logn) ,0(/nlogn) ] .  (Simple  fan-in  arguments  show  that  ft(logn)  is  a 

lower  bound  for  the  computation  time,  and  A  *  Q(nlogn)  is  an  immediate 

consequence  of  the  semellective  assumption , so  that  computation  times  slower 

that  6(/nlogn)  cannot  result  in  smaller  area.) 

In  Section  2  we  introduce  a  general  framework  for  sorting  algorithms, 

called  COMBINE-SORT,  which  is  based  on  an  operation,  COMBINATION, 

generalizing  the  operation  of  MERGING  from "two  to  several  sequences.  Section  3 

and  4  describe  an  architecture  (COMBINER)  and  an  algorithm  for  COMBINATION, 

respectively.  The  combiner  network  so  obtained  is  then  used  in  Section  5  as  a 

building  block  for  a  general  class  of  COMBINATION-sorters .  One  of  these  sorters 

2 

is  shown  to  have  optimal  area  A  *  0(n  ) ,  for  T  *  9 (logn)  computation  time. 
Finally,  in  Section  6,  we  discuss  the  area-time  trade-off  for  sorting,  and 
show  that  optimal  sorters  can  be  constructed  for  any  computation  within  the 


above  range. 


5 


2.  A  Class  of  Parallel  Sorting  Algorithms 


Several  sorting  algorithms  can  be  viewed  as  particular  cases  of  a  rather 


general  scheme,  which  we  now  describe. 

We  call  COMBINATION  the  operation  that  produces  from  m  sorted  sequences  of 
t  elements  each  one  sorted  sequence  of  mt  elements.  A  network  implementing 
this  operation  is  called  an  (m,t) -COMBINER.  When  m  *  2,  COMBINATION  reduces 
to  merging. 

A  parallel  algorithm  for  the  (m,t) -COMBINER  has  been  introduced  in  [8] , 

and  is  based  on  the  following  idea.  The  m  input  sequences  S_,...,S  _  are 

U  m-l 

pairwise  merged  to  compute  for  each  i,j  €  {0,1 , . . . ,m-l} ,  and  each  i  €  {0#1 , . .  .  ,t-l} , 
the  number  C^(£)  of  elements  of  sequence  S ^  that  are  less  than  the  1-th  element 
of  sequence  S^.  C^Cl)  is  reac*ily  obtained  as  the  difference  of  the  ranks  of 
this  element  in  the  merge  of  and  and  in  S^.  By  summing  the  C^CO’s 
over  j  we  then  obtain  the  rank  of  the  1-th  element  of  in  the  output  sequence 
of  the  COMBINER;  thus,  to  complete  the  operation,  we  simply  need  to  store  each 
element  in  the  position  specified  by  its  rank.  The  primitive  operation  of  the 
scheme  —  the  merging  of  two  sequences  —  can  be  done,  for  example  by  Batcher’s 
bitonic  merger  [4] . 

Given  n  *  ra, m0,..m,  elements,  we  can  sort  them  in  d  stages  according 
1  l  d 

to  the  following  scheme  that  we  call  COMBINE-SORT. 

At  stage  1  we  perform  n/m^  combination  operations,  each  on  sequences  of 
1  element  each.  At  stage  2  we  perform  n/m^m^  combinations,  each  on  m2  sequences 
of  m^  elements  each,  and  at  stage  i  we  perform  n/m^  ...  combination,  each  on 
m^  sequences  of  length  ...  m^  Finally,  at  stage  d  we  combine  sequences 
of  length  n/m^  into  one  sequence  of  length  n,  which  is  the  output  of  the 


COMBINE- SORT  scheme. 
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A  diagrammatic  illustration  of  the  scheme  is  given  in  Figure  1  in  the  form 
of  a  rooted  tree.  Each  node  of  this  tree  is  a  suitable  combiner.  An 
(m  ,t^  ^-COMBINER,  1  <  i  <_  d ,  performs  the  combination  of  (sorted)  sequences 
of  length  t^;  here  tQ  »  1  and  t±^  »  for  i  >  1.  Note  that  each 

level  of  the  tree  corresponds  to  a  stage  of  the  combination  scheme,  and  that 
there  are  n^  =  n/t^  nodes  at  level  i,  1  f  i  £  ^ 
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Figure  1.  Diagram  of  COMBINE-SORT  scheme. 
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Several  known  sorting  algorithms  can  be  cast  in  the  COMBINE-SORT  scheme. 

Each  algorithm  is  characterized  by  a  particular  factorization  of  n  =  m, . . .m 

1  d 

(note  that  the  order  of  the  factors  is  relevant  here),  and  by  the  specification 

of  how  the  combination  is  to  be  performed.  In  particular  if  we  use  the 

COMBINER  based  on  [8]  we  have  the  following  cases. 

(i)  When  n  »  2^,  and  m.  *  m_  =  . . .  *  m ,  =  2 ,  then  COMBINE-SORT  reduces  to 

1  Z  a 

the  usual  MERGE-SORT. 

(ii)  When  d  =  1 ,  and  m^  =*  n,  the  COMBINE-SORT  reduces  to  only  one 

(n , 1) -COMBINER,  which  is  essentially  the  sorting  network  described 
in  [  7  ]  . 

(iii)  When  d  =  loglogn/log(l/ (1-a) )  %  and  m,  .  *  n*  ^  ^  with  0  <  a  <  1, 

a-i 

we  obtain  the  sorting  schemes  described  in  [8].  The  sorting  scheme 

corresponding  to  a  given  a  can  be  described  as  follows.  The  n-input  sequence 

is  split  into  na  (m,  in  our  terminology)  sequences  of  n^  (t  ,  -  in  our 
d  d— 1 

terminology)  elements  each.  These  sequences  are  sorted  recursively,  and  then 

combined  by  an  (m,,t,  )-COMBINER  .  The  recursion  stops  when  sequences  of 
d  d— 1 

length  1  are  obtained.  We  can  obtain  the  values  for  d  and  m  by  a  simple 

analysis  of  the  unfolded  recursive  process. 

In  the  following  sections,  we  shall  explore  which  other  choices  of  d 

and  m,  9m_  9 . .  •  9m .  can  be  made  to  minimize  the  complexity  of  a  VLSI 
1  Z  d 

implementation  of  COMB INE- SORT . 
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3.  An  (m, t) -COMBINER  Network 

In  this  section  we  propose  a  parallel  architecture  for  an  (m, t) -COMBINER, 

where  m  ■  2U  and  t  =  2T  are  powers  of  two.  This  architecture  will 

accept  as  input  m  sorted  sequences  of  t  elements  each, 

Si  =  ^Si^  °  )»si(l),...,si(t-l))  i  =  0  , 1, . * . ,m-l 

and  produce  as  output  a  single  sorted  sequence  S,  which  is  the  combination  of 

Sn,...,S  ,  and  has  N  -  mt  «  2V  elements, 

u  m-l 

S  =  (s(0) ,s(l) , . . . ,s(N-l) ) . 

The  (m, t) -COMBINER  will  execute  the  algorithm  based  on  pairwise  merging 

as  outlined  in  the  preceding  section.  Its  organization  is  illustrated  in 

2 

Figure  2.  It  consists  of  m  modules  (each  capable  of  merging  two  sequences  of 
length  t  and  of  computing  partial  ranks) ,  laid  out  as  a  square  m  *  m  mesh  and 
indexed  as  M_  (i,j  -  0 , 1 , .  .  .  ,m-l)  .  The  modules  of  each  row  are  interconnected  as 
the  leaves  of  a  binary  tree  of  bandwidth  t;  so  are  the  modules  of  each  column. 
Thus,  the  combiner  has  the  structure  of  the  orthogonal-trees  machines  [3,6],  whose 
leaves  are  merging  modules.  The  interconnec ting  trees  have  the  following 
functions : 

(i)  to  broadcast"  a  sequence  to  all  units  in  which  it  must  be  merged  with 
some  other  sequence ; 

(ii)  to  compute  global  ranks  from  partial  ranks; 

(iii)  to  rearrange  the  elements  according  to  their  ranks  into  the  sorted 
sequence  S* 

We  will  now  describe  in  some  detail  the  merging  modules  and  the 
interconnecting  trees. 


10 


3.1,  Merging  Modules 


Merging  module  M_  will  merge  sequences  and  and  compute  C_(l), 
for  l  =  We  recall  that  C^(Z)  is  the  number  of  elements  of 

that  are  less  than  (respectively  less  than  or  equal)  s^(£)  when  i  _<  j,  (when  i  >  j) 
Each  module  is  realized  (Figure  3)  as  a  cube-connec ted-cycle  (CCC) , 
interconnec tion  of  smaller  processing  elements,  called  micromodules  (each 
micromodule  has  a  bandwidth  of  1  bit).  Specifically,  the  merging  module  is 
a  (t+1,2x+^)-CCC  (i.e.,  it  has  2X+^  cycles  each  of  length  t+1),  We  number  the 
micromodules  of  M.  ,  as  M..(h,k),  with  0  <_  h  <  t+1  and  0  £  k  <  2X+\  so  that  the 

T+1 

merging  module  may  be  thought  of  as  a  (t+1)  *  2  array  (rows  are  numbered  from 
bottom  to  top,  columns  from  left  to  right).  The  columns  of  this  array  are 


connected  as  cycles  with  a  link  between  M..(h,k)  and  M . . (h , (k+l)mod( t+1)) .  The  rows 

ij  ij 

0,1,..., t  are  associated  with  the  dimensions  Eq»E^,...,E  of  a  (t+1) -dimens ional 
binary  cube  [9],  and  there  is  a  link  between  M..(h,k, )  and  M  .  (h,k„)  if  and  only  if 

ij  1  ij  2 

the  binary  expansions  of  k^  and  k?  differ  exactly  in  the  coefficient  of  2  . 

The  reader  is  referred  to  [10]  for  a  detailed  explanation  of  the  CCC; 
he  must  also  be  warned  that  in  this  paper  we  will  not  use  the  CCC  at  its  full 


capability,  since  we  deploy  a  network  with  2t(log2t)  (rather  than  2t)  micro- 

T  + 1 

modules  to  merge  two  sequences  of  length  t.  In  other  words,  a  2  binary  cube 

is  emulated  by  a  (t+1 , 2 l+^) -CCC .  ^ ^  When  the  2'+^  items  on  which  we  operate  have 

to  be  processed  on  the  cube  dimension  E^,  we  just  need  to  guarantee  that  the 

items  are  in  row  h  of  the  CCC.  Thus,  execution  of  the  ASCEND  and  DESCEND 

paradigms,  in  which  the  dimensions  are  used  in  the  sequence  (Eq  ,  E  , .  .  .  ,E<r )  , 

and  (E^,E^  , . . . ,Eg)  respectively,  is  quite  straightforward, 

3 

The  layout  of  a  (3,2  )-CCC  in  Figure  3  shows  two  sets  of  4  input  lines 
(denoted,  respectively,  as  RT-  and  CT-lines)  each  carrying  one  of  the  two 


4-element  sequences  to  be  merged. 


i 


For  this  reason  the  number  of  micromodules  in  a  cycle  is  not  constrained 
to  be  a  power  of  2. 


* 
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3 

Figure  3.  Merging  unit  M^_.  realized  by  a  (3,2  )-CCC,  used  to  merge  two 
sequences  with  four  elements  each. 


Recalling  from  [10]  that  a  CCC  with  N  processing  elements  of  constant  area 

2  2 

can  be  laid  out  in  area  0(N  /log  N)  ,  we  conclude  that  a  merging  module  can 

i  2 

be  laid  out  in  area  0(A^t  ),  where  is  the  area  of  the  micromodule.  In 
|  the  next  section,  in  connection  with  the  COMBINATION  algorithm,  we  shall 

specify  the  functional  capabilities  of  each  micromodule,  from  which  it  will 
§  be  clear  that  Aq  is  constant,  i.e.,  independent  of  the  problem  size. 

i 

i 
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3.2.  Interconnecting  Trees 

As  indicated  earlier,  the  merging  modules  are  interconnected  by  two 
families  of  N  =  mt  complete  binary  trees  with  m  =  2U  leaves  and  bandwidth  1. 
We  will  refer  to  these  families  as  the  row  trees  and  column  trees. 

The  lines  of  the  row  trees  and  the  column  trees  are  respectively  labelled 


RT^ (2)  and  CT  U),  i  =  0,...,m-l;  2  *  0,...,t-l.  The  trees  and  the  merging 
modules  are  connected  through  a  small  interface  ,whose  structure  will  be  fully 
specified  in  connection  with  the  description  of  the  COMBINATION  algorithm  in 
the  next  section.  At  this  point  we  just  say  that  the  leaves  of  RT^(2)  are, 
from  left  to  right,  connected  to  the  CCC  micromodules 

M.-(0 ,2) (0, 2),... ,M.  ,  (0,2);  the  leaves  of  CT. (2)  are  connected  to  the 

iO  il  i  ,m-l  j 

CCC  micromodules  .  (0 ,  t-1+2)  .  (0  ,  t-1+2)  ,  .  .  .  ,M  1  .(0,t-l+2);  in  other 

Oj  lj  m-1  ,j 

words, the  row  trees  and  the  column  trees  are  respectively  connected  to  the  RT 


and  the  CT  lines  of  the  merging  modules.  The  connection  between  each  leaf  of 
a  tree  and  the  corresponding  CCC  micromodule  is  realized  through  a  buffer 
register  of  the  appropriate  size  (adequate  to  store  one  element  to  be  sorted) . 
The  situation  is  illustrated  in  Figure  4. 


RTi(0) 

RTi(3) 


Figure  4.  Interconnection  of  modules  and  trees. 
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4.  The  COMBINATION  Algorithm 

We  now  describe  how  the  sorting  algorithm  of  [8],  based  on  pairwise 
merging,  can  be  executed  on  the  architecture  introduced  in  Section  3.  This  analysis 
will  elucidate  the  structure  of  the  CCC  micromodules.  We  recall  that  the  inputs 
are  m  =  2U  sorted  sequences  of  t  =  2T  elements  each,  with  N  =  2V  =  mt . 

For  convenience  we  split  the  algorithm  into  several  phases. 

(A)  Input  of  Data  and  Broadcasting  to  Merging  Modules 

Element  s^(£)  is  input  at  the  root  of  tree  RT^CO,  and  is  then  broadcast 

to  .all  leaves  of  the  tree.  At  this  point,  the  left  half  of  row(O)  in 

module  M_  contains  the  sequence  S^.  To  fill  the  right  halves  of  row(O)  of 

all  modules,  we  proceed  as  follows.  First,  in  each  "diagonal"  module  the 

sequence  is  copied  in  the  second  half  of  row(O)  .  (This  can  be  done 

by  using  the  connection  of  row(i)  between  the  left  and  the  right  half 

of  the  machine.)  Next,  from  micromodule  (0 , t-1+5,)  ,  which  is  a  leaf  of 

d\(£),  element  s_j(i£)  is  broadcast  (through  the  root)  to  all  the  other  leaves  of 

the  same  tree.  At  this  point,  the  merging  module  M. .  contains  S.  and  Ss  in 

ij  l  j 

the  0-th  row  and  merging  can  begin. 

(B)  Merging  and  Partial  Rank  Computation 

Merging  can  be  executed  by  resorting  to  the  bitonic  algorithm,  which 
complies  with  the  DESCEND  paradigm  of  the  binary  cube  (see  [10]). 

However,  in  order  to  execute  bitonic  merging,  we  first  need  to  reverse  the 
order  of  .  This  is  accomplished  by  an  ASCEND  algorithm  in  which  columns 
t  to  2t-l  of  each  M^  exchange  their  data  at  dimensions  E^,...,E^  while 
columns  0  to  t-1,  remain  idle.  All  the  columns  are  idle  at  dimension  E  . 

T 

Now  the  data  are  ready  for  bitonic  merging.  At  each  dimension. 


E  ,E  ,E.Ert , pairs  of  elements  are  compared  and  exchanged,  if  necessary, 

T  T-1  1  U 

to  place  the  smaller  of  two  in  the  column  with  the  smaller  number.  Each 
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processor  (micromodule)  of  the  merging  module  is  equipped  with  a  serial 

comparator  that  reads  the  inputs  starting  from  the  most  significant  bit.  As 

long  as  the  two  inputs  agree,  they  are  transmitted  to  the  next  processor  in  the 

same  column.  As  soon  as  a  bit  discrepancy  is  detected,  a  switch  is  set  and,  from 

now  on,  the  remaining  substrings  of  the  operands  will  follow  a  fixed  path,  respectively 

independently  of  their  value.  It  is  then  easy  to  see  how  the  computation 

through  the  rows  of  the  CCC  can  be  naturally  pipelined  to  achieve  a  computation 

time  of  0(r+q),  where  q  is  the  length  of  the  input  words.  At  the  end  of  merging, 

the  result  resides  in  row(O)  of  the  CCC,  and  the  element  in  M^(0,Z), 

0  <  i  <  2t-l,  has  rank  i  in  MERGER ,S  )  .  Now  we  want  to  transmit  the  ranks 
of  s.(0) , . . . fsi(t-l)  to  processors  M  (0 ,0) , . . . (0 , t-1) ,  respectively. 

This  is  accomplished  by  retracing  backwards  the  path  traversed  by  each  element 
s.(j),  and  is  easily  done  if  each  keeps  track  of  whether  it  exchanged 

or  not  the  operands  during  the  merging  process.  So,  all  we  have  to  do  is  to 
run  the  machine  backwards,  with  an  ASCEND  algorithm,  which  applies  to  the  ranks 


the  inverse  of  the  permutation  that  merged  the  elements.  At  the  end  of 
this  phase,  processor  0  <  l  <_  t-1,  stores  the  number  of  elements  in 

MERGE (S^,Sj )  that  are  less  than  s^i).  If  from  this  number  we  subtract  i  we 
obtain  C  (1)  ,  number  of  elements  of  S  which  are  less  than  s^(l)  .  We  call 

J  J 

the  C  ,’s  partial  ranks  because  from  them  we  can  compute  the  rank  of  each 

i;j  m-i 

s  (it)  in  the  sorted  sequence  S  as  C  (£)  *  Z  C  (!) . 

1  1  j-0  2 

(C)  Total  Rank  Computation 

It  is  immediate  to  see  that  at  the  end  of  phase  B  the  partial  ranks 


C, (4)  »C. „(£),...  ,€.  -  U)  of  s,U)  are  available  exactly  at  the  leaves 

iO  il  I  tin-l  i 

of  row  tree  RT^it).  By  having  in  each  internal  node  of  the  tree  a  full  adder 
with  a  1-bit  delay  feedback  on  the  carry,  we  can  then  obtain  at  the  root 
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<5f  RT^  the  sum  C^(£)  the  values  stored  at  the  leaves.  The  nodes  work  as 
serial  adders  and  the  tree  is  used  in  a  pipelined  fashion,  so  that  the  time 
required  is  0(y+x),  where  jj  =  logm  is  the  depth  of  the  tree,  and  x+1  is  the 
wordlength  of  the  operands  (note  that  C_(il)  H  2X).  Within  the  same  order  of 
time,  we  can  subsequently  broadcast  C^(i)  from  the  root  to  the  leaves. 

(Indeed  C^(2,)  <  2T+^,  so  it  can  be  expressed  by  x+U  bits.) 

(D)  Sorting  Permutation  and  Output  of  Data 

We  want  to  output  the  elements  s (0) , . . . ,s (N-l)  of  the  sorted  sequence  from 
the  roots  of  the  column  trees,  and,  specifically,  we  want  the  root  of  CT^(Z)  to 
output  element  s(j2  +£).  This  corresponds  to  a  natural  lef t-to-r ight  order 
of  the  column  trees  as  they  appear  in  the  layout  of  Figure  2. 

Considering  a  generic  element  s^(h)  with  rank  C^(h),  the  binary  spellings 
of  the  integers  j  and  i  so  that  s^(h)  will  emerge  from  the  root  of  column  tree 
CTj(JO  are  readily  obtained  by  taking  the  u  most  significant  bits  and  the  x 
less  significant  bits  of  the  rank  C^(h)  to  represent  l  and  j,  respectively. 
Thus,  as  a  first  step,  we  "activate'1  in  the  elements  of  sequence  that 
have  to  emerge  from  trees  CT^'s,  and  "inhibit"  all  other  elements.  The 
active  elements  are  those  whose  rank  C^h)  has  the  y  most  significant  bits 
agreeing  with  the  column  number  j  of  the  merging  module.  Next,  we  rearrange 
the  active  elements  in  M^_.  so  that  s^h)  is  sent  to  >^(0,2,),  with 
l  *  C^(h)mod  t. 


This  operation  is  essentially  a  permutation  of  the  active  (and  non-active) 
elements,  and  can  be  done  by  using  the  CCC  as  an  emulator  of  the  Benes-network. 
The  setting  of  the  switches,  although  nontrivial,  is  greatly  simplified  with 
respect  to  the  general  case  by  the  fact  that  the  active  elements  do  not  change 
their  relative  order.  The  desired  rearrangement  can  be  done  by  using  the 
idea  of  concentration  introduced  in  [11],  and  expansion ,  which  could  be 
viewed  as  the  inverse  of  concentration.  If  k  elements  are  active  in  a 


) 


I 

i 

I 
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given  module,  they  are  first  sent  to  the  k  leftmost  columns  of  the  CCC 
(concentration) ,  and  then  routed  to  the  destination  columns  (expansion) . 

A  straightforward  adaptation  of  the  algorithm  that  is  proposed  in  [11]  for  concentration 

in  the  cube-machine  shows  that  an  ASCEND  and  a  DESCEND  phase  is  all  that  is 

required  to  rearrange  data  on  our  CCC.  Some  bits  required  to  set  the  switches  must  be 

precomputed.  This  task  could  be  performed  by  the  CCC,  or  (to  keep  the 

micromodule  structure  as  simple  as  possible),  the  task  can  be  assigned  to  a 

binary  tree  of  full  adders  whose  leaves  would  be  contained  in  the  interface 

between  the  CCC  and  the  row- trees. 

During  the  entire  rearrangement  task,  computation  takes  place  only  in 
the  left-half  of  the  CCC  without  using  dimension  .  We  then  transfer  each 
active  element  from  M^(0,1)  to  (0 , t-l-K) ,  with  a  straightforward  use  of 
dimension  E  . 


At  this  point  element  s(j2T+l)  is  in  M  (0 ,  t-l-K)  ,  (where  the  value  of  i  is 
determined  by  the  input  sequence  to  which  s(j2  +£)  originally  belongs  to),  and 
is  ready  to  be  transmitted  to  the  root  of  CT^(l)  where  it  is  output. 

4.1.  Performance  Analysis  and  Modification  of  the  Network 
The  entire  machine,  even  when  not  explicitly  said,  is  intended  to  work  in 
bit  serial  mode.  Both  the  CCC's  and  the  trees  work  in  a  pipeline  fashion. 

Thus  any  operation  takes  essentially  time  proportional  to  the  sum  of  the 
operand  length  and  the  pipe  depth.  For  the  CCC’s  the  depth  is  x+1.  The  operands 
to  be  handled  have  length  q  when  they  are  input  words  or  x+1  when  they  are 
partial  ranks.  Since  a  constant  number  of  ASCEND  and  DESCEND  algorithms  are 
executed,  we  conclude  that  0(x+q)  total  time  is  spent  in  the  CCC’s.  For  the 
trees  the  depth  is  y+1.  The  operands  to  be  handled  have  length  q  when  they  are 
input  words,  or  x+y  when  they  are  total  ranks.  Since  a  constant  number  of 
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fan-in  and  fan-out  algorithms  are  executed,  we  conclude  that  0(x+y+q)  total 
time  is  spent  in  the  trees.  Thus  the  time  spent  in  the  interconnecting  trees 
dominates  that  spent  in  the  CCC's,  and  we  reach  the  conclusion  that  the 
(2^, 2T) -COMBINER  of  elements  of  q  bits  works  in  time  T  =  0(t+u+q). 

50  far,  all  the  parameters  t,  y,  and  q  have  been  regarded  as  independent  of 

y 

each  other.  We  now  make  an  interesting  observation.  When  q  *  ft(2  ),  then 
T  -  0(T+q).  In  this  case  the  time  performance  of  the  trees  is  not  substantially 
degraded  if  we  realize  them  as  comb-trees,  rather  than  as  complete  binary  trees. 
The  depth  will  go  from  y  to  2y,  but  this  is  tolerable  in  time  since  2y  =*  0(q)  . 

On  the  other  hand  comb— trees  can  be  laid  out  in  constant  rather  than  logarithmic 
width,  thus  yielding  a  saving  in  area.  The  modified  (2^,2  ) -COMBINER  of 

words  of  length  q  *  ft(2U)  has  then  T  =  0(t+q)=0(logN+q)  and  A  =  0(2  *  0(N  ). 

A . 2 .  Summary  of  Symbols  and  Results  for  an  (m,t) -COMBINER 
Sizes:  m  =  2U,  t  =  2X ,  N  =  mt,  q  *  wordlength 

Input  sequences : 

51  *  (s^O)  ,3^1)  , . . .  ,s^(t-l))  i  =*  0,l,...»m-l  . 

Output  sequence: 

S  -  (s(0) ,s(l) ,. . . ,s(N-l) ) . 

T-f*l 

Merging  modules:  (x+1,2  )-CCCfs 

V  ■  0>1 . -1 

M  (h,k);  0  ^  h  <  t+1,  0  _<  k  <  2t ,  micromodules  of  . 

Row-crees  and  column-trees: 


RT1(il),  CTjU) :  0  <  i,j  <  m-1,  0  <  i  <  t-1. 


Machine 

Performance 

A 

T 

Full  tree  version 

0(n2u2) 

0(T+w+q) 

Comb-tree  version 
q  -  n(m) 

0(N2) 

0(T+q) 

L8 


5.  An  Architecture  for  COMBINATION- SORT 

We  shall  now  use. the  COMBINER  developed  in  the  two  preceding  sections  to 
construct  a  general  network,  for  C  OMB I  NAT  I  ON- SO  RT  .  As  an  intermediate  step  in 
the  construction,  we  introduce  a  new  operation  called  COALESCENCE.  Given  a 
collection  of  n  elements,  partitioned  into  n/t^_^  sorted  subsequences  each 
containing  t^_^  e^ements»  and  given  a  multiple  t^  of  t^  which  is  also  a 
divisor  of  n,  we  call  (n;t^  * t^) -COALESCENCE  the  operation  of  combining  (in  the 

sense  defined  earlier)  consecutive  blocks  of  sul:>se(luences- 

If  we  refer  to  the  tree  of  Figure  1,  we  can  easily  see  that  each  level 
of  the  tree  corresponds  to  a  coalescence  of  the  input  sequence.  If  we  call 
COALESCER  a  network  that  performs  a  coalescence,  we  can  build  a  COMBINATION- 
SORTER  by  cascading  a  suitable  set  of  coalescers,  as  shown  in  Figure  5. 


Figure  5.  COMB I NAT ION -SORTER  as  a  cascade  of  COALESCERS. 

5.1.  The  COALESCER 

An  :  t^-COALESCER  can  be  easily  constructed  by  using  n^  ^  n/t^ 

(mi,ti^i> -COMBINERS.  Let  us  assume,  for  simplicity,  that  n^  is  a  perfect 
square.  We  can  then  lay  out  the  combiners  in  a  Vn^  x  i/n^  array  with  input  and 
output  lines  running  in  a  chosen  direction,  say,  parallel  to  the  rows.  An 
example  with  n^  * 


4  is  shown  in  Figure  6 . 


H  C  ^ 
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Figure  6.  Layout  of  an  (n;t^_^  :  t -COALESCER  with  -  n/t  . 
(mi-ti/ti_1,ti_1)-COMBINERS. 


We  now  estimate  the  area  of  the  COALESCER.  We  first  assume  to  use 

full-tree  COMBINERS,  so  that  the  side  of  the  COMBINER  has  a  length  of  OCt^logm^), 

For  the  layout  shown  in  Figure  6,  we  then  have: 

logm , 

height  =*  OCv'nT  t^logm^+n^t^)  =  0(n(l+  - —  )) 


iogm 

width  =  0(v/rT~  t.logm.)  =  0(n( -  )) 

111  /r 


If  instead  we  use  comb-tree  COMBINERS,  the  size  becomes 

height  *  0(n) 

logm 

width  =  0(n  -  ) 


The  computation  time  is  T  = 0 (x+q+logm  )  for  the  full-tree  COALESCER,  and 

r  l 

T  =  0(x+q+m.)  for  the  comb-tree  COALESCER.  When  q  =  0  (logn) ,  then  T  =0(logn) 
Ci  t 

If,  in  addition,  m  =  O(logn) ,  then  T  =  O(logn). 

1  C 

5.2.  An  Optimal  VLSI  Sorter 

From  the  previous  considerations  it  is  easy  to  see  that  we  can  obtain 

a  VLSI  implementation  of  COMBINATION- SORTERS  by  suitable  use  of  COALESCERS. 

It  should  also  be  easy  to  compute  time  and  area,  once  the  factorization 

n  =  m^m^...md  for  the  algorithm  is  chosen. 

We  now  show  that  there  is  a  COMB I NAT I ON -SORTER  for  words  of  length 

2 

q  -  0(logn)  that  sorts  n  elements  in  time  T  *  O(logn)  and  area  A  *  0(n  ), 
thus  achieving  the  known  lower  bound  for  this  problem.  The  sorter  we  propose 


is  given  by  the  block  diagram  in  Figure  7. 


Figure  7.  A  COMBINATION- SORTER  with  three  COALESCERS,  for  optimal  VLSI  sorting. 
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From  the  general  analysis  we  easily  see  that  the  coalescers  take  area 
(width  x  height)  0(n)  x  0(n) ,  O(nloglogn//logn)  x  0(n)  ,  and  0(n)  x  0(n) 


respectively.  It  is  also  clear  that  the  total  time  is  O(logn),  thus  our  claim 
is  proved. 
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6 .  Area-Time  Trade-Off 

The  COMBINATION-sorter  proposed  in  this  paper  has  optimal  area  among  sorters 
that  achieve  minimum  computation  times.  It  is  now  interesting  to  ask  whether 

we  can  trade  time  for  area,  and  build  a  slower  but  smaller  sorter  with  optimal 

2  2  2 
AT  *  9 (n  log  n) . 

Since,  as  we  already  recalled  in  the  Introduction,  area-time  optimal 

circuits  for  sorting  can  be  built  when  T  €  [ft(log  n) ,0(/nlogn) ]  (for  a 

(H-e)logn  wordlength)  the  range  of  computation  times  for  which  no  optimal  circuits 

3 

is  known  yet  is  [ft (logn),0(log  n)  ]  . 

We  will  now  describe  a  network,  which,  by  choosing  an  appropriate  value  for 

2  2  2 

a  design  parameter,  allows  us  to  sort  in  A  =  0(n  log  n/T  )  for  any  time 
T  €  [ft (logn) ,0(vnlogn) ] .  The  network  is  the  cascade  interconnection  of  two 
components.  The  first  component  is  a  COMBINATION-sorter  for  ^  inputs.  The  second 
component  is  a  new  general  architecture,  called  the  mesh-of-CCC  (MCCC)  and 
obtained  by  suitably  hybridizing"  known  networks  (the  mesh  and  the  CCC) . 

This  architecture  will  now  be  described  in  detail. 

An  (n,s)-MCCC,  with  n  =  2V ,  s  =  2° ,  and  r  =  n/s^  =  2P (p=v-2a)  consists  of 

2 

s  CCC  modules,  each  with  r  cycles  of  length  o.  The  n  *  o 

processing  elements  of  the  MCCC  are  conveniently  indexed  as 
M^(h,k):  0  £  i,j  <  s,  0  <  h  <  p,  0  <  k  <  r. 

For  a  fixed  (i,j)  pair  the  set  {M^(h,k):0  £  h  <  p,0  £  k  <  r}  is  connected  as 
a  CCC-module,  exactly  as  described  in  Section  3.  Then  CCC  modules  are  arranged 
as  an  s  x  s  mesh,  and,  for  a  fixed  k,the  set  of  micromodules  {M^(0,k):0  £  i,j  <  s} 
is  mesh-connected  (with  i  and  j  as  row  and  column  indices  respectively) . 
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The  MCCC  closely  resembles  the  COMBINER  architecture  defined  in  Section  4, 
more  specifically,  the  version  with  comb-trees.  In  fact  the  MCCC  could  be 
obtained  from  the  comb-tree  connected  CCC?s  by  identifying  in  all  CCC's 
micromodules  M_(h,k)  and  M_(h,k+t)  (with  0  _<  k  <  t)  ,  and  deleting  the  edges 


related  to  E  . 

t 

The  mesh  of  CCC's  is  a  very  interesting  network  in  its  own  right,  and  we 

shall  now  show  how  it  can:  (i)  emulate  the  ASCEND  (or  DESCEND)  paradigm  [10] 

2  2  2 

of  the  Binary-Cube  in  optimal  AT  *  0(n  log  n)  for  any  computation  time 
2 

T  €  [c(log  n) ,0( v nlogn) ] ;  (ii)  emulate  the  SORTING  paradigm  [3]  in  optimal 

2  2  2  3  _ _ 

AT  =  0(n  log  n)  for  any  computation  time  T  €•  [C(log  n) , 0( /nlogn) ] .  (Recall 

that  we  are  referring  to  a  9(logn)  input  words,  and  to  a  word-serial  mode  of 

operation . ) 

If  we  consider  a  v-dimensional  binary  cube  whose  processors  are 

Pq »^i 9 • • * »P^  ^  (n  -  2V) ,  we  can  establish  the  following  correspondence  between 

MCCC  micromodules,  and  cube  processors: 

M  (0,k)-«-*-  P  ,  t=2.j+2yi+k. 

s 

Then  it  is  easy  to  see  that  dimension  Eq,...,E  ^  of  the  cube  are  assigned  to 

the  CCC  modules,  dimensions  E  ,...,E  -  are  assigned  to  the  mesh  columns, 

0  P+0— 1 

and  finally  dimensions  E  ,  . . . ,E  ,  n  ,  are  assigned  to  the  mesh  rows.  Thus, 

p+0  p+2a-l  ° 

by  application  of  well  known  techniques  for  emulating  the  cube  with  a  CCC  or  a 
linear  array  [10],  an  ASCEND  (or  DESCEND)  algorithm  can  be  executed  in  0(p+s) 
word-steps . 

2  2 

On  the  other  hand  the  MCCC  can  be  trivially  laid  out  in  an  0(n  /s  ) 

2  2  2 

square,  since  each  CCC  requires  0(~r)  area  and  channels  of  0(n  /s  )  width 

s 

allow  a  straightforward  implementation  of  mesh-connections.  In  conclusion, 

for  s  in  the  range  [^(logn)  ,  0(/n/logn) ] ,  considering  that  o  *  O(logn)  and 

2  2 

that  a  word  step  takes  O(logn)  time,  we  obtain  T  *-  O(slogn)  and  A  *  0(n  /s  ), 


and 


2 

which  gives  an  optimal  AT  . 


24 


The  MCCC,  used  in  the  way  just  described,  would  not  be  optimal  for  the 
execution  of  bitonic  sorting.  Bitonic  sorting  of  n  =  2°  elements  consists  of 
v  merging  phases  with  phase  performing  the  merging  of  pairs 

of  sequences  of  length  21,  and  requiring  on  the  cube  the  successive  use  of 
dimensions  E_^  So,  sc^e^’J^e  use  dimensions  for  a 

complete  sorting  is 


V  ,E1’E0:,  WV 


Ev-1’Ev-2’"-,E0 


-X' 


v-1 


For  brevity,  we  shall  call  this  schedule  [3]  the  sorting  paradigm.  On  the  MCCC 
the  sorting  paradigm  requires  0(p logn+slogs)  word  steps,  more  than  we  desire. 

We  can  eliminate  the  logs  factor  by  a  technique  (already  successful  in  the 
construction  of  the  Pleated-CCC)  consisting  of  an  alternate  arrangement  of  the 
2a  topmost  dimensions  to  columns  and  rows.  More  precisely,  if 


a_1  2  0-1  2  0-1  21 

i  =  :  i.2\  j  =  I  j  2\  t’  =  I  (2i  +  j.)  2"  , 

2=0  2=0  2=0 


then  we  establish  between  MCCC  micromodules  and  cube  processors  the  correspondence 

M. . (0,k)  —  P  .  c  =  t 1  — y  +  k . 
ij  c  s2 

For  this  correspondence  a  simple  argument  (similar  to  one  given  in  [  3  ]) 

shows  that  only  0(plogn+s)  word-steps  are  required  for  execution  of  the  sorting 

paradigm. 

Again,  for  3  €  [  p4(log  n)  ,  0(/n/logn)]  ,  recalling  that  0(logn)  time  is  used 

2  2 

for  a  word  step,  we  obtain  T  *  O(slogn)  ,  and  A  *  0(n  /s  ).  Although  our  main 

purpose  in  defining  the  MCCC  is  to  construct  optimal  sorters  for 

3 

T  €  [ (logn)  ,  0(log  n)  ]  , 


we  have  seen  that  the  MCCC  is  an 
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optimal  emulator  of  the  cube  for  both  the  ASCEND  and  the  SORTING  paradigms. 

Let  us  also  point  out  another  interesting  feature  of  the  MCCC,  namely  that 

the  maximum  edge-length  in  the  layout  is  0(-^-)  .  For  s  =  8(logn)  we  obtain  a 

2  s  (3) 

maximum  (edge-length)  =  O(n/log  n)  ,  which  is  optimal.  In  fact  [12] 

maxedge-length  =  ^(voptimal  area/diameter)  for  any  graph,  and  for  the  MCCC 

2  2 

optimal  area  =  0(n  /log  n) ,  and  diameter  =  9(logn)  .  It  is  also  interesting 

to  recall  that  the  optimal  layouts  known  for  the  CCC  and  the  Shuffle-Exchange 

contain  edges  of  length  O(n/logn) . 

To  obtain  networks  faster  than  the  MCCC  we  start  from  the  following 

observation.  A  COMBINE-sorter  with  n/s  input  can  sort  (in  time  O(slogn) 

2  2 

and  area  0(n  /s  ))  s  sequences  of  n/s  elements  each.  These  sequences  can 

then  be  fed,  say  one  per  column,  into  an  MCCC  with  parameter  s.  The  sequence 

in  each  CCC  module  is  at  this  point  already  sorted,  and  the  MCCC  is  ready 

(after  inverting  the  order  of  some  sequences  to  comply  with  bitonic  sorting 

rules)  to  execute  the  last  2 a  merging  phases.  (For  the  sake  of  simplicity  we 

will  ignore  the  fact  that  only  a  phases  would  be  really  necessary  after  the 

work  done  by  the  COMBINE-sorter.)  A  simple  analysis  allows  us  to  conclude  that, 

in  the  process,  the  MCCC  executes  O(logs+s)  steps  using  O(logn)  time  each 

thus  running  for  a  total  time  T  =  O(slogn). 

2 

In  conclusion,  when  s  £  [^(l),0(log  n) ]  ,  the  computation  time  T  of  the 

3 

entire  machine  ranges  in  [71  (logn)  ,0(log  n)  ]  ,  and  for  each  T  the  layout 

2  2  2 

area  is  optimally  9 (n  log  n/T  ). 


(3) 


This  property  has  been  noted  also  by  A.  Aggarwal  for  an  architecture  very 
similar  to  the  MCCC  (private  communication) . 
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